Method for measuring a degree of association for dimensionally referenced data

ABSTRACT

A method for measuring a degree of association ( 58 ) between, and for selectively creating a grouping ( 40 ) of, n plurality of spatially referenced physical events ( 22 ) of a predetermined physical characteristic. The method includes the steps of assembling n plurality of physical events ( 26 ), assembling a universe of possible sample locations ( 36 ), determining a reference distribution ( 54 ), determining a restricted distribution ( 56 ), and determining the degree of association ( 58 ) between the n plurality of physical events. Specifically, the physical events each have an indicia of location and a physical characteristic above a threshold. The step ( 54 ) of determining a reference distribution is conducted by calculating a test statistic ( 78 ) for each of n′ plurality of random allocations ( 74 ) of the n plurality of physical events over the selected n plurality of sample locations. Further, the step ( 58 ) of determining a restricted distribution includes calculating the test statistic ( 82 ) for each of n″ plurality of restricted random allocations ( 80 ) of the n plurality of physical events over the n plurality of sample locations ( 62 ).

SPONSORSHIP

This invention was made with government support by the United States ofAmerica under Grant No. R43 CA65366 awarded by the National CancerInstitute. The United States government has certain rights in theinvention.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention relates generally to a method for detectingstatistically significant dimensional relationships between physicalevents and, more particularly, to a method and apparatus for measuring adegree of association between spatially referenced physical events inorder to group these physical events when appropriate.

2. Discussion

It is difficult to overstate the importance of accurately measuringlocalized occurrences of a physical event such as, for example, diseaseoutbreaks or unsafe pollution concentrations. Incorrectly identifyingthe existence of such an event can lead to unnecessarily alarmingindividuals in the “affected” area as well as to causing the expenditureof resources, monetary and otherwise, better allocated elsewhere.Potentially even more devastating is the failure to recognize theexistence of a localized event as early as possible. Left unchecked,highly contagious diseases can spread to cause an epidemic whileenvironmental conditions such as increased pollution levels canirrevocably damage fragile natural balances.

Physical events are often spatially or temporally related with localizedoccurrences referred to as clusters. In these instances, the ability toaccurately determine the existence of localized occurrences of physicalevents depends in part upon the specificity of the spatial or temporalproperty of each event. Despite the need to accurately measurestatistically significant clustering in a variety of contexts, currentlyavailable modeling techniques do not accurately reflect the location ofeach event and therefore too often lead to incorrect inferences. Thisproblem is particularly troublesome in spatially referenced physicalevents having uncertain spatial locations.

Sources of location uncertainty arise in a variety of contexts. Forexample, uncertainty can arise in an epidemiologic context due to theanonymity commonly maintained during the reporting of health events, theuncertainty of exposure locations given the mobility of human activity,and the transient nature of many environmentally transmitted diseasecausing agents. Uncertainty is amplified by the recording of eventlocations based upon zip code zones, census tracts, or grid nodes.Location uncertainty is also prevalent in the analysis of otherspatially referenced physical events such as in the environmental andphysical sciences (e.g. biology, geology, and hydrology).

Randomization testing of recorded events is commonly used to inferwhether a spatial pattern exists within the sample of spatiallyreferenced physical events. In these tests, the statistical significanceof the spatial pattern is generally evaluated through the use of actualor estimated sample locations. When the actual locations of the samplesare uncertain, a model is used to approximate the locations of thesamples. The most frequently used method for approximating the locationof a sample is the centroid model which assigns the area centroidlocation to all cases or samples occurring within an area.

A particular disadvantage of using randomization tests based uponcentroid approximations is that the approach does not consider thespatial distribution of the at-risk population. As a result,approximations based upon the centroid of an area rather than thedistribution of the at-risk population create an unnecessarily andinaccurately small universe of possible sample locations. For example,in epidemiological analyses, the universe of sample locations forrandomization is more properly related to the geographic distribution ofthe human population in general and, more particularly, to thedistribution of individuals at risk for a particular disease.

Additionally, randomization tests are problematic for spatial databecause currently used techniques assume that the sampling spaceconsists of the locations at which the observations were made. That is,they erroneously assume that the universe of possible locations consistentirely and solely of the sample locations. However, in mostsituations, other locations in the study area could have been sampled.As a result, the sampling space for the spatial randomization test isincorrectly specified and the distributions generated during the testpertain only to the sample locations rather than the at-risk populationwithin the study area. This incorrect approximation leads to detectionerrors and the potentially dire consequences associated therewith,whether the locations of the physical events are certain or uncertain.

Accordingly, it is an object of the present invention to provide amethod for accurately determining the degree of association betweenphysical events.

A further object of the present invention is to provide a method foraccurately determining the degree of association between physical eventshaving uncertain locations.

Another object of the present invention is to determine the degree ofassociation between a plurality of spatially referenced physical eventsbased upon an analysis of reference and restricted distributions of atest statistic.

Still another object of the present invention is to determine therelative degree of illness for a given area based upon a comparison ofthe degree of association between the physical events to a thresholdvalue.

A further object of the present invention is to determine the degree ofassociation between a plurality of spatially referenced physical eventsthrough the use of a location model that reflects the spatialdistribution of an at-risk population.

SUMMARY OF THE INVENTION

The present invention provides a method for measuring a degree ofassociation between, and for selectively creating a cluster of, nplurality of spatially referenced physical events of a predeterminedphysical characteristic. The method includes the steps of assembling nplurality of physical events, assembling a universe of possible samplelocations, determining a reference distribution, determining arestricted distribution, and determining the degree of associationbetween the n plurality of physical events. Specifically, the physicalevents each have an indicia of location and a physical characteristicabove a threshold. The step of determining a reference distribution isconducted by calculating a test statistic for each of n′ plurality ofrandom allocations of the n plurality of physical events over theselected n plurality of sample locations. Further, the step ofdetermining a restricted distribution includes calculating the teststatistic for each of n″ plurality of restricted random allocations ofthe n plurality of physical events over the n plurality of samplelocations.

BRIEF DESCRIPTION OF THE DRAWINGS

Further objects, features and advantages of the present invention willbecome apparent to those skilled in the art from studying the followingdetailed description and the accompanying drawings, in which:

FIG. 1 is a schematic Illustration of a method according to the presentinvention shown in relation to the physical events of interest;

FIG. 2 is a flow chart showing steps of the preferred method withreference to a hypothetical spatial area of analysis; and

FIG. 3 is a graphic illustration of an exemplary reference distributionand restricted distribution of the Mantel test statistic.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The following description of a preferred embodiment of the presentinvention is merely exemplary in nature and is not intended to undulylimit the scope of the claimed invention. Moreover, the followingdescription, while depicting the invention for use with spatiallyreferenced epidemiological events, is intended to adequately teach oneskilled in the art to use the method and apparatus to measure a degreeof association between spatially referenced physical events,particularly those events having uncertain locations, regardless of theunderlying nature or characteristic of the event. Specifically, thoseskilled in the art will appreciate that the method and apparatusdescribed and claimed herein is applicable to determine the degree ofassociation for a variety of spatially referenced physical eventsincluding environmental events related to geology, hydrology, orpollution control.

FIG. 1 generally illustrates the implementation of the method andapparatus of the present invention 10 to a physical area 12 having afirst sub-area 14 and a second sub-area 16. In this example, each ofsub-areas 14 and 16 have a reporting station 18 and 20, respectively,where events of a physical characteristic are reported. The reporting ofan epidemiological event to authorities, such as at a clinic, iscommonly anonymous such that the residence of the person, presumably thebest estimate of the location of contraction, is uncertain. Thislocation uncertainty is represented in FIG. 1 by locating the eventoccurrences 22, identified by “+”, within reporting stations 18 and 20.The method and apparatus 10 of the present invention includes assemblingthe physical events of interest and determining a degree of associationthat represents a measurement of the proximity of the event occurrenceswithin the physical area 12. More particularly, in an embodiment of theinvention, the degree of association for the physical events is used togroup the events that warrant further investigation or an immediateresponse.

FIG. 1 further illustrates a structure for Implementing the presentinvention as well as the relationship of this structure to physical area12. The physical events 26, each including a physical characteristic andan associated, though uncertain, physical location, are selected andassembled In a first data structure 32. In a similar fashion, a seconddata structure 34 is assembled to include the physical locations of auniverse of possible sample locations 36 that, in the embodiment of theinvention hereinafter described, is a model of the actual at-riskpopulation or the spatial density thereof. Computer processor 38communicates with first and second data structures 32 and 34,respectively, to retrieve selected physical characteristics and possiblesample locations therefrom and determine the degree of association 24between the physical events in the manner described below with referenceto FIG. 2. A grouping 40 of physical events 26 is created when thedetermined degree of association exceeds a predetermined value. In thecontext of this description, the creation of grouping 40 identifies thephysical events that have a spatial proximity to one another that issufficient to warrant further investigation or intervention.

With continued reference to FIG. 1, the physical events 26 included infirst data structure 32 include only those events having a physicallocation within the area of analysis, e.g. area 22 of FIG. 1, and anassociated physical characteristic above a predetermined threshold. Itshould be appreciated that the predetermined threshold may be zerowhereby physical events exhibiting the desired physical characteristicare included in first data structure 32 regardless of the degree of thecharacteristic. Further, as to second data structure 34, the number ofpotential samples in the universe generally exceeds the number ofphysical events in first data structure 32. This excess of possiblesamples allows the method and apparatus of the present invention to moreaccurately determine the degree of association between the physicalevents as hereinafter described. Those skilled in the art will realizefrom this description that the location models hereinafter described maybe used to expand the universe of samples and thereby increase theaccuracy of the degree of association without regard to whether thephysical events have a certain or uncertain location.

In sum, the present invention determines a degree of association betweenevents that is representative of the actual association between theevents. While the events have an actual degree of association based uponthe proximity of the event locations, the actual association isincapable of measurement because the events are reported with locationsthat are generally not reflective of the actual event locations. Thedetermined degree of association is used to selectively create an eventgrouping that warrants further investigation or intervention activityIncluding action to mitigate or eliminate the condition represented bythe physical characteristic of the reported event. As a result, thepresent invention modifies the location characteristics of the reportedevents to more accurately reflect actual conditions. The creation of anartificial group that more closely represents actual conditions isillustrated in FIG. 1 by grouping 40 compared to reporting locations 18and 20.

It should be noted that while the descriptions and illustrations hereinrelate specifically to spatially referenced physical events occurringwithin an area, the invention is equally applicable to physical eventsassociated with one another in other dimensions and occurring within aselected zone, e.g. temporal relationships within a time interval.

FIG. 2 illustrates a specific exemplary application of the presentinvention with respect to an area of analysis 12′ that includes a firstsub-area 14′ and a second sub-area 16′ as previously described. In thisexample, each of the five selected physical events 26 illustrated inFIG. 1 are represented by an “x” and referenced by numerals 42, 44, 46,48, and 50. Further, in this example, each of physical events 26 arefurther represented by data in the form (x_(i), y_(i), z_(i)), wherex_(i) and y_(i) are geographic coordinates representative of thelocation of the physical event and z_(i) represents the physicalcharacteristic of the event. While the locations of the physical eventsare illustrated as distributed within area of analysis 12′, in theillustrated embodiment of the invention, the exact locations of thephysical events are known only to the extent that events 42 and 44occurred within first sub-area 14 and events 46, 48, and 50 occurredwithin second sub-area 16.

Selected physical events 42, 44, 46, 48, and 50 are assembled In firstdata structure 32 and a universe of sample locations 52 within area ofanalysis 12′ are assembled within second data structure 36. In thepresent invention, universe 52 preferably includes the exact locationsof the at-risk population within the respective sub-areas or the densitydistribution of the at-risk population within the area. While thepreferred models for generating universe 52 are described in detailhereinafter, those skilled In the art will appreciate that universe 52is based upon a measured or estimated distribution of a selectedpopulation as a whole, or a specified portion thereof.

Population density information is presently available from which theassembled sampling space or universe would Include individualsparticularly at risk to a certain health event. For example, theuniverse of sample locations may include the population density ofindividuals of a childhood age when determining the presence ofstatistically significant clustering of cases of childhood leukemia.Alternatively, the universe of possible sample locations could includethe at-risk population density in view of the probability that anindividual in the general population would be exposed to a conditionconducive to contracting a particular disease.

FIG. 2 further schematically illustrates the steps performed, such as bycomputer processor 38 shown in FIG. 1, in determining the degree ofassociation between physical events 42, 44, 46, 48, and 50.Specifically, the steps include generating a reference distribution 54,a restricted distribution 56, and determining a degree of association58. The generation of reference and restricted distributions 54 and 56,respectively, include the step of generating sample locations 60 and 62,respectively, from the universe of possible sample locations in seconddata structure 34. The generated sample locations are equal in number tothe physical events 26 included in first data structure 32. Thoseskilled in the art will appreciate from this description as well as theappended claims and drawings that virtually any number, n, of physicalevents 26 and sample locations may be selected for use with the presentinvention. However, in the preferred embodiment, the number, x, ofpossible sample locations within first sub-area 14′ selected from seconddata structure 36 is equal to the number, x, of selected physical events26 located in first sub-area 14′. Similarly, the number, y, of possiblesample locations within second sub-area 16′ selected from second datastructure 36 is equal to the number, y, of selected physical eventslocated in the second sub-area. Accordingly, in the illustratedembodiment shown in FIG. 2, each generation of sample locations 60 and62 include two selected locations from first sub-area 14′ and threeselected sample locations from second sub-area 16′.

As generally indicated in FIG. 2, the generation of referencedistribution 54 further includes randomly allocating, step 74, thephysical characteristics z_(i) of each selected physical event over allof the generated sample locations within area of analysis 12′.Specifically, each of the physical characteristics, i.e., z₁, z₂, . . .z₅, are assigned with equal probability to each of the sample locationswithout regard to whether the sample locations are from the first orsecond subareas 14′ and 16′, respectively. This random allocationcorresponds to a statistical null hypothesis of no association betweenthe physical characteristics z_(i) and their locations (x_(i), y_(i))within the respective sub-areas. A test statistic Is then calculated instep 78 for each fully randomized allocation of the physicalcharacteristics, z_(i), over the repeatedly generated sample locations.In summary, reference distribution 54 is generated by repeatedlygenerating sample locations (step 60), randomly allocating the physicalcharacteristics over the sample locations (step 74), and calculating thetest statistic (step 78). Those skilled in the art will appreciate thatthe number of repetitions, n′, performed to generate referencedistribution 54 is represented by index of randomization k′, and isvariable and dependent upon a number of factors including the number ofselected physical events, i.e. events 42, 44, 46, 48, and 50, and thesignificance level one wishes to resolve.

In general, it is convenient to express the test statistic as a crossproduct (a Γ product) although this invention applies generally to allstatistics calculated from spatially referenced data. More specifically,for spatially referenced data, the Γ product is: $\begin{matrix}{\Gamma = {{A \otimes B} = {\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{N}{a_{ij}b_{ij}}}}}} & (1)\end{matrix}$

-   -   where “N” is the number of locations, “a” is a proximity measure        and “b” is calculated from the observations on z.

Those skilled in the art will recognize that the null hypothesis forthis cross-product statistic is that observations on z are independentof proximity. The alternative hypothesis being that observations on zare in some way associated with proximity.

There are three general measures of proximity that provide a ready meansfor quantifying spatial relationships in the Γ product. As is known inthe art, these proximity measures quantify spatial and/or temporalrelationships between pairs of points and are of three basic types:adjacency, distance, and nearest neighbor. Those skilled in the art willalso appreciate the advantages and disadvantages of each of these threetypes as well as that each may be used in the method described andclaimed herein. For exemplary purposes, adjacency based statistics suchas join-count and Moran's I statistics may be used for adjacency basedanalysis whereas Mantel's test for distance-based and Cuzick andEdwards' test for nearest neighbor-based analysis may also be used. Theequations for these analyses are generally recognized in the art. Forcompleteness, Mantel's distance-based cross product statistic forspace-time clustering is: $\begin{matrix}{T = {\sum\limits_{i = 1}^{n}{\sum\limits_{j = 1}^{n}{s_{ij}t_{ij}}}}} & (2)\end{matrix}$

-   -   where s_(ij) and t_(ij) are space-time distances between cases i        and j.

As shown in FIG. 2, spatially restricted distribution 56 is generatedthrough restricted randomization, i.e., the physical characteristics,z_(i), randomly allocated (step 80) among the sample locations withinthe same sub-area. For example, after generating an appropriate numberof sample locations in step 62, characteristics z₃, z₄, and z₅ fromphysical events 46, 48, and 50 are allocated over the three samplelocations within second sub-area 16′. Likewise, the z₁ and z₂characteristics from events 42 and 44 are randomly allocated over thetwo selected sample locations within first sub-area 14′. By thisrestricted allocation, the association between the z_(i) characteristicsand the respective sub-areas for the physical events are maintained. Thetest statistic, Γ, is calculated in step 82 for each of the n″generations of sample locations thereby yielding the restricteddistribution 56 of the test statistic. In a manner similar to referencedistribution 54, the number, n″, of repetitions used to calculate thetest statistic for restricted distribution 56 is represented by index ofrestricted randomization k″ and may vary depending upon, among otherfactors, the physical characteristics of interest. In one embodiment ofthe present invention the number of repetitions for the referencedistribution, n′, and number of repetitions for the restricteddistribution, n″, are equal.

A graphic illustration of reference distribution 54 and restricteddistribution 56 is shown in FIG. 3. The relative positions of thedistributions Is the basis for determining the degree of associationbetween physical events 26. More particularly, as shown in FIGS. 2 and3, determination of the degree of association 58 includes selecting acritical value 83 and determining the credibility 84 of the nullhypothesis. The selection of critical value 83 represents a trade offbetween the acceptability of receiving false positives, i.e. incorrectlyrejecting the null hypothesis (type I error), and obtaining falsenegatives, i.e. accepting the null hypothesis when it is false (type IIerror). As a result, the selection of critical value 83 is dependentupon the particular event under scrutiny given the uncertain locations.For example, in an epidemiological context, a disease that is highlycontagious or fatal generally deserves a high degree of scrutiny andtherefore a low critical value is selected so that any suggestion ofspatial association leads to the creation of an event group and furtheraction or investigation. Conversely, if false positives have severeconsequences, a higher critical value may be used. Critical value 83 isgenerally expressed as a 1-α value and is often approximately 95%(α=0.05).

Credibility 84 describes the possibility of statistically significantclustering and is defined as the proportion of the restricteddistribution that meets or exceeds critical value 83. Credibility 84represents the degree of association between physical events 26 and canbe used to determine the statistical significance of the dimensionalproximity of the events and whether the association warrants creation ofa grouping 92 (FIG. 2). Specifically, if n plurality of physical eventsexhibit a credibility over a predetermined threshold, grouping 92 iscreated to encompass these events thereby indicating that thedimensional relationship between the physical events is statisticallysignificant. In general, it is contemplated that a credibility ofgreater than 0.05 is statistically significant for most characteristics.

To this point, reference has been made to randomly sampling a universeof possible sample locations such as the at-risk population within anarea in order to determine sample locations for the allocation of thephysical characteristics. While a multitude of options are available togenerate the sample space, four specific modeling alternatives arecontemplated for use with the present invention. More particularly, thepoint model 86, population model 88, and polygon model 90 described indetail below and shown in FIG. 2 correspond to different levels ofknowledge regarding spatial locations and are designed for situationscommonly encountered in public health practice and the environmentalsciences. Further, risk model 91 reflects not only the distribution of ageneral population but also the probability that the members of thepopulation will contract a certain event of interest.

It will be apparent to one skilled in the art from the followingdescription, that the accuracy of the information obtained from pointmodel 86 is superior to that from the population and polygon models,while the population model provides the next best results. Further, forsimplicity of exposition, the locations models are described below inthe context of human population data. However, it should be appreciatedthat the models also apply In the earth sciences and other fields withlittle modification.

Point models 86 are used when a finite list of alternate exact locationsare available. This situation arises, for example, when one knows a caseoccurred within a specific census district, but the exact place ofresidence of the case is not known. In point models, the list ofalternate locations is constructed as the coordinates of all places ofresidence within the census district and can be obtained from a varietyof methods including aerial photography, topographic maps showingbuilding locations, and address-matching software which output latitudeand longitude of street addresses. Point model 86 is preferred becauseit offers the greatest spatial resolution while its greatest weakness isthat a list of alternative locations may be difficult to construct or beunavailable.

Population models 88 are used when the underlying population densitydistribution is known. This density distribution is (hen used toallocate possible case locations within an area. For example, locationswith high population density are sampled most frequently. The resolutionof the population model depends on the available population densitysurface generally obtainable such as from population density dataproduced by the Global Demography Project. Population data is alsoavailable from other sources such as U.S. census files that reportpopulation in census tracts and blocks as well as state informationsystems that report population by sections. Population model 88 offersease of use due to the availability of the population density data aswell as the ease of data incorporation into software applications.However, the population model may be less appropriate for diseases whichstrike particular risk groups whose population density is not highlycorrelated with the total population density within the area.

Polygon models 90 assume that the probability of sampling is uniformwithin sub-areas and applies when alternative places of residence areunknown, precluding the use of point models, and information onpopulation density is lacking, thereby excluding population models. Theuse of a polygon model is frequently warranted due to datainsufficiencies. However, these models offer the least resolution as aresult of the assumption that population density is homogeneous withinthe sub-areas.

As described above, the point, population, and polygon models are pointdensity functions describing the locations where events could haveoccurred. Risk model 91, on the other hand, recognizes that samplingprobability at a location is dependent on two functions, i.e., thespatial density distribution of the physical events and the space-timeprocess which propagates the physical characteristic of interest acrossthe distribution. For example, for disease processes, the spatialdensity distribution of the physical events would generally describe theplace of residence for individuals. Further, for a contagious disease,the second function would describe the transmission dynamics of thedisease itself. Examples where the risk model would accurately representthe applicable universe of sampling locations include the study ofmalaria areas where small mosquito populations would have small samplingprobability even in dense populations. As a further example, individualsproximate to a forested area would be given a higher probability ofcontracting lyme disease in view of the greater risk that they wouldcome in contact with ticks. Further customization of the universe ofpossible sample locations may be provided in environmental applicationssuch as pesticide exposures in agricultural areas, heavy metaltransport, etc.

Various other advantages and modifications will become apparent to oneskilled in the art after having the benefit of studying the teachings ofthe specification, the drawings, and the followings claims.

1-19. (canceled)
 20. A method, comprising: establishing a set ofoccurrences of an event; identifying a pair of the occurrences having aproximity less than a predetermined value.
 21. A method, as set forth inclaim 20, wherein each occurrence includes an associated location.
 22. Amethod, as set forth in claim 21, wherein the proximity of the pair ofoccurances is a function of associated locations between each occurrenceand another occurrence.
 23. A method, as set forth in claim 22, whereinthe proximity is based on at least one reference location.
 24. A method,as set forth in claim 23, the reference location being defined by amodel of an at-risk population.
 25. A method, as set forth in claim 24,the model representing the spatial-density of the at-risk population.26. A method, as set forth in claim 21, wherein each associated locationis an estimate of the location of the occurrence.
 27. A method, as setforth in claim 20, the event having a parameter, each occurrence havinga value of the parameter, the method further comprising the steps of:comparing the value of the parameter of each occurrence with a secondpredetermined value; and, including, in the subset, occurrences whosevalue of the parameter exceeds the second predetermined value.